Refactor: Decouple Core Logic into a Reusable Library #15

ayeganov · 2025-09-11T15:31:27Z

This PR refactors the project from a monolithic script into a well-defined, reusable library. The core training, data handling, and tokenizer management logic have been extracted from train.py into decoupled, object-oriented components. The goal is to create a clean API that can be easily used and extended in other projects.

The train.py script is now a simple command-line client that demonstrates how to use the new library components.

Key Changes ✨

Trainer Class: A new scratchgpt/training/trainer.py module introduces the Trainer class, which now encapsulates all logic for training loops, validation, pre-tokenization, and model checkpointing.
DataSource Protocol: A new, flexible scratchgpt/data/datasource.py module defines a protocol for data loading. We've included concrete FileDataSource and FolderDataSource implementations, replacing the old TextProvider classes.
Refactored Tokenizer I/O: The get_tokenizer function in scratchgpt/model_io.py has been updated to use a factory pattern. This makes creating a default tokenizer more robust and explicit.
Modularized Tests: The single, large test file has been broken down into smaller, focused modules under the tests/ directory (test_tokenizer_io.py, tests/tokenizers/..), improving maintainability.
Upgraded CLI: The train.py script now uses a --tokenizer argument to dynamically load any tokenizer from the Hugging Face Hub, making it significantly more versatile.

Highlights for Review 🔍

When reviewing, please pay special attention to:

The Trainer API: This is the new heart of the library. Is its interface clear? Does it correctly encapsulate the training logic?
The DataSource Protocol: This is our core data abstraction. Is it flexible enough for future use cases?
get_tokenizer Factory Pattern: Review the new signature in model_io.py. This is a key design pattern for how we manage object creation.
The New train.py: As the first client of our new library, does it demonstrate a clean and intuitive workflow?

scratchgpt/model/model.py

scratchgpt/model_io.py

scratchgpt/train.py

scratchgpt/training/trainer.py

first pass at creating trainer class

427fe1f

ayeganov requested a review from dariocazzani September 11, 2025 15:31

ayeganov self-assigned this Sep 11, 2025

ayeganov added 4 commits September 11, 2025 11:47

add progress bar to pretokenization

afaa3d9

make linters happy

10b981b

update gitignore

006da92

update readme

c5fc8ec